• CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5
continuous attributes
• DATA DESCRIPTION: The data concerns city-cycle fuel consumption in miles per gallon
• Attribute Information:
PROJECT OBJECTIVE: Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg’
1.Import and warehouse data: [ Score: 3 points ]
• Import all the given datasets and explore shape and size.
• Merge all datasets onto one and explore inal shape and size.
• Export the inal dataset and store it on local machine in .csv, .xlsx and .json format for future use.
• Import the data from above steps into python.
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Import Pythin libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#import pandas_profiling
import warnings
warnings.filterwarnings('ignore')
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
from sklearn.preprocessing import StandardScaler
# Import First dataset and check shape
df_CarName = pd.read_csv("/content/drive/MyDrive/AI_ML_course/Unsupervised_Learning/Car name.csv")
df_CarName.sample(5)
df_CarName.shape
| car_name | |
|---|---|
| 95 | buick electra 225 custom |
| 283 | amc concord dl 6 |
| 307 | oldsmobile omega brougham |
| 71 | mazda rx2 coupe |
| 334 | mazda rx-7 gs |
(398, 1)
# Import First dataset and check shape
df_CarAtr = pd.read_json("/content/drive/MyDrive/AI_ML_course/Unsupervised_Learning/Car-Attributes.json")
df_CarAtr.sample(5)
df_CarAtr.shape
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 76 | 18.0 | 4 | 121.0 | 112 | 2933 | 14.5 | 72 | 2 |
| 357 | 32.9 | 4 | 119.0 | 100 | 2615 | 14.8 | 81 | 3 |
| 74 | 13.0 | 8 | 302.0 | 140 | 4294 | 16.0 | 72 | 1 |
| 312 | 37.2 | 4 | 86.0 | 65 | 2019 | 16.4 | 80 | 3 |
| 214 | 13.0 | 8 | 302.0 | 130 | 3870 | 15.0 | 76 | 1 |
(398, 8)
<Figure size 1200x600 with 0 Axes>
# Combine two Dataframes
frames = [df_CarName,df_CarAtr]
df_mpgRaw = pd.concat(frames,axis=1,verify_integrity=True)
df_mpgRaw.sample(5)
df_mpgRaw.shape
| car_name | mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|---|
| 341 | chevrolet citation | 23.5 | 6 | 173.0 | 110 | 2725 | 12.6 | 81 | 1 |
| 242 | bmw 320i | 21.5 | 4 | 121.0 | 110 | 2600 | 12.8 | 77 | 2 |
| 122 | saab 99le | 24.0 | 4 | 121.0 | 110 | 2660 | 14.0 | 73 | 2 |
| 26 | chevy c20 | 10.0 | 8 | 307.0 | 200 | 4376 | 15.0 | 70 | 1 |
| 299 | peugeot 504 | 27.2 | 4 | 141.0 | 71 | 3190 | 24.8 | 79 | 2 |
(398, 9)
# Save Raw datasets in Excel, CSV and Json for future use
df_mpgRaw.to_csv('Cardata.csv',index=False)
#df_mpgRaw.to_excel('Cardata.xls', index=False)
df_mpgRaw.to_json('Cardata.json')
# Import data from files
mpgRaw_csv = pd.read_csv("Cardata.csv")
#mpgRaw_excel = pd.read_excel("Cardata.xls")
mpgRaw_json = pd.read_json("Cardata.json")
# Check info of data
mpgRaw_csv.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 398 entries, 0 to 397 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 car_name 398 non-null object 1 mpg 398 non-null float64 2 cyl 398 non-null int64 3 disp 398 non-null float64 4 hp 398 non-null object 5 wt 398 non-null int64 6 acc 398 non-null float64 7 yr 398 non-null int64 8 origin 398 non-null int64 dtypes: float64(3), int64(4), object(2) memory usage: 28.1+ KB
# Checking standard null values in data
mpgRaw_csv.isnull().sum()
car_name 0 mpg 0 cyl 0 disp 0 hp 0 wt 0 acc 0 yr 0 origin 0 dtype: int64
print(mpgRaw_csv.iloc[330])
car_name renault lecar deluxe mpg 40.9 cyl 4 disp 85.0 hp ? wt 1835 acc 17.3 yr 80 origin 2 Name: 330, dtype: object
# Change Hp datatype to float and Create Final dataframe
CarForMpg = mpgRaw_csv.copy()
CarForMpg.shape
CarForMpg['hp'] =CarForMpg['hp'].apply(pd.to_numeric, errors="coerce")
# Checking null values
CarForMpg["hp"][CarForMpg["hp"].isnull()]
(398, 9)
32 NaN 126 NaN 330 NaN 336 NaN 354 NaN 374 NaN Name: hp, dtype: float64
print(CarForMpg["hp"].median())
93.5
# Replace Null Values with median values
CarForMpg.fillna(CarForMpg.median(), inplace=True)
# Describe the numerical data
CarForMpg.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| mpg | 398.0 | 23.514573 | 7.815984 | 9.0 | 17.500 | 23.0 | 29.000 | 46.6 |
| cyl | 398.0 | 5.454774 | 1.701004 | 3.0 | 4.000 | 4.0 | 8.000 | 8.0 |
| disp | 398.0 | 193.425879 | 104.269838 | 68.0 | 104.250 | 148.5 | 262.000 | 455.0 |
| hp | 398.0 | 104.304020 | 38.222625 | 46.0 | 76.000 | 93.5 | 125.000 | 230.0 |
| wt | 398.0 | 2970.424623 | 846.841774 | 1613.0 | 2223.750 | 2803.5 | 3608.000 | 5140.0 |
| acc | 398.0 | 15.568090 | 2.757689 | 8.0 | 13.825 | 15.5 | 17.175 | 24.8 |
| yr | 398.0 | 76.010050 | 3.697627 | 70.0 | 73.000 | 76.0 | 79.000 | 82.0 |
| origin | 398.0 | 1.572864 | 0.802055 | 1.0 | 1.000 | 1.0 | 2.000 | 3.0 |
CarForMpg.dtypes
car_name object mpg float64 cyl int64 disp float64 hp float64 wt int64 acc float64 yr int64 origin int64 dtype: object
sns.pairplot(CarForMpg);
# Scaling the data before K Means
from scipy.stats import zscore
CarForMpg_model = CarForMpg.drop("car_name", axis=1).apply(zscore)
CarForMpg_model.head()
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 0 | -0.706439 | 1.498191 | 1.090604 | 0.673118 | 0.630870 | -1.295498 | -1.627426 | -0.715145 |
| 1 | -1.090751 | 1.498191 | 1.503514 | 1.589958 | 0.854333 | -1.477038 | -1.627426 | -0.715145 |
| 2 | -0.706439 | 1.498191 | 1.196232 | 1.197027 | 0.550470 | -1.658577 | -1.627426 | -0.715145 |
| 3 | -0.962647 | 1.498191 | 1.061796 | 1.197027 | 0.546923 | -1.295498 | -1.627426 | -0.715145 |
| 4 | -0.834543 | 1.498191 | 1.042591 | 0.935072 | 0.565841 | -1.840117 | -1.627426 | -0.715145 |
CarForMpg_model["hp"].describe()
count 3.980000e+02 mean -7.141133e-17 std 1.001259e+00 min -1.527300e+00 25% -7.414364e-01 50% -2.830161e-01 75% 5.421404e-01 max 3.292662e+00 Name: hp, dtype: float64
#Finding optimal no. of clusters
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
clusters=range(1,20)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k);
model.fit(CarForMpg_model);
prediction=model.predict(CarForMpg_model);
meanDistortions.append(sum(np.min(cdist(CarForMpg_model, model.cluster_centers_, 'euclidean'), axis=1)) / CarForMpg_model.shape[0]);
plt.plot(clusters, meanDistortions, 'bx-');
plt.xlabel('k');
plt.ylabel('Average distortion');
plt.title('Selecting k with the Elbow Method');
It can be seen that slope change (Elbow exists) at Cluster size of 5 (k =5) Selecting that for further study
# With K = 5
#Finding optimal no. of clusters
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
final_model=KMeans(n_clusters=5)
final_model.fit(CarForMpg_model)
prediction=final_model.predict(CarForMpg_model)
#Append the prediction
CarForMpg["GROUP"] = prediction;
CarForMpg_model["GROUP"] = prediction;
print("Groups Assigned : \n");
CarForMpg.sample(5)
KMeans(n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=5)
Groups Assigned :
| car_name | mpg | cyl | disp | hp | wt | acc | yr | origin | GROUP | |
|---|---|---|---|---|---|---|---|---|---|---|
| 282 | ford fairmont 4 | 22.3 | 4 | 140.0 | 88.0 | 2890 | 17.3 | 79 | 1 | 1 |
| 251 | mercury monarch ghia | 20.2 | 8 | 302.0 | 139.0 | 3570 | 12.8 | 78 | 1 | 0 |
| 359 | peugeot 505s turbo diesel | 28.1 | 4 | 141.0 | 80.0 | 3230 | 20.4 | 81 | 2 | 1 |
| 260 | dodge aspen | 18.6 | 6 | 225.0 | 110.0 | 3620 | 18.7 | 78 | 1 | 3 |
| 235 | toyota corolla liftback | 26.0 | 4 | 97.0 | 75.0 | 2265 | 18.2 | 77 | 3 | 2 |
CarForMpgClust = CarForMpg.groupby(['GROUP'])
CarForMpgClust.mean()
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| GROUP | ||||||||
| 0 | 14.429787 | 8.000000 | 350.042553 | 162.393617 | 4157.978723 | 12.576596 | 73.468085 | 1.000000 |
| 1 | 28.791045 | 4.194030 | 132.567164 | 82.865672 | 2563.805970 | 16.549254 | 79.671642 | 1.074627 |
| 2 | 34.137500 | 4.083333 | 99.527778 | 72.875000 | 2155.819444 | 16.734722 | 79.416667 | 2.763889 |
| 3 | 19.104938 | 6.222222 | 233.444444 | 101.882716 | 3298.580247 | 16.632099 | 75.703704 | 1.037037 |
| 4 | 24.619048 | 4.047619 | 108.601190 | 85.672619 | 2347.619048 | 16.107143 | 73.309524 | 2.107143 |
CarForMpg_model.boxplot(by='GROUP', layout = (2,4),figsize=(15,10));
#### generate the linkage matrix
from scipy.cluster.hierarchy import dendrogram, linkage
CarForMpg_model = CarForMpg_model.drop("GROUP",axis=1)
Z = linkage(CarForMpg_model, 'ward', metric='euclidean')
Z.shape
(397, 4)
# Plot Dendrogram
plt.figure(figsize=(25, 10));
dendrogram(Z);
plt.show();
# Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram
dendrogram(
Z,
truncate_mode='lastp', # show only the last p merged clusters
p=10, # show only the last p merged clusters
);
plt.show();
#Finding Maximum distance and then clusering again
max_d = 5
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z, max_d, criterion='distance')
clusters
array([20, 20, 20, 20, 20, 19, 19, 19, 19, 20, 20, 20, 20, 19, 3, 15, 15,
15, 3, 2, 2, 4, 2, 4, 15, 22, 22, 22, 22, 3, 1, 3, 2, 15,
15, 15, 15, 15, 22, 22, 21, 21, 22, 22, 22, 15, 2, 15, 15, 1, 4,
2, 4, 13, 13, 2, 2, 3, 1, 2, 2, 1, 22, 22, 21, 21, 20, 19,
22, 22, 22, 3, 21, 21, 21, 21, 6, 2, 2, 2, 1, 3, 3, 1, 3,
22, 20, 21, 21, 21, 19, 22, 22, 21, 19, 19, 20, 15, 16, 15, 15, 15,
2, 22, 22, 22, 22, 15, 3, 2, 3, 3, 2, 15, 4, 21, 19, 2, 4,
6, 6, 20, 6, 6, 20, 15, 15, 15, 16, 13, 1, 13, 1, 16, 16, 16,
21, 21, 21, 21, 21, 4, 4, 4, 13, 13, 1, 4, 4, 3, 3, 4, 16,
16, 16, 16, 22, 21, 21, 21, 16, 16, 16, 16, 17, 18, 18, 13, 1, 15,
1, 5, 4, 3, 15, 10, 16, 6, 6, 6, 6, 13, 4, 4, 1, 1, 4,
18, 18, 18, 18, 17, 17, 17, 17, 2, 2, 10, 13, 17, 16, 16, 16, 10,
13, 13, 1, 6, 18, 12, 6, 6, 22, 18, 18, 18, 13, 9, 10, 1, 13,
18, 16, 18, 18, 16, 17, 17, 17, 22, 22, 22, 18, 10, 1, 13, 1, 9,
9, 13, 10, 6, 6, 5, 11, 9, 13, 14, 14, 18, 18, 18, 17, 17, 17,
1, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 9, 5, 5, 9, 5, 1,
1, 5, 6, 6, 6, 6, 10, 13, 17, 17, 1, 17, 17, 18, 18, 18, 18,
18, 18, 18, 18, 10, 14, 9, 7, 12, 17, 12, 17, 9, 9, 13, 10, 7,
8, 8, 8, 10, 14, 9, 14, 7, 7, 7, 17, 10, 14, 14, 14, 14, 14,
7, 14, 11, 11, 12, 12, 14, 10, 14, 10, 5, 5, 10, 7, 14, 7, 7,
7, 8, 8, 14, 9, 14, 14, 14, 14, 14, 9, 9, 7, 10, 10, 14, 14,
14, 14, 12, 12, 5, 5, 17, 17, 17, 17, 7, 7, 7, 7, 7, 7, 7,
7, 10, 14, 14, 9, 9, 14, 14, 14, 14, 14, 14, 17, 7, 7, 17, 14,
8, 7, 7, 11, 8, 7, 7], dtype=int32)
# Visualizing Clusters with
#### plt.figure(figsize=(10, 8))
np.random.seed(101) # for repeatability of this dataset
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[150,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[98,])
c = np.random.multivariate_normal([10, 20], [[3, 1], [1, 4]], size=[150,])
X = np.concatenate((a, b, c), axis=0)
print(X.shape) # 398 samples with 2 dimensions
plt.scatter(X[:,0], X[:,1], c=clusters) # plot points with cluster dependent colors
plt.show()
(398, 2)
<matplotlib.collections.PathCollection at 0x7f22173b8790>
CONTEXT: Company X curates and packages wine across various vineyards spread throughout the country.
DATA DESCRIPTION: The data concerns the chemical composition of the wine and its respective quality.
Attribute Information:
PROJECT OBJECTIVE: Goal is to build a synthetic data generation model using the existing data provided by the company.
Steps and tasks: [ Total Score: 5 points]
# Import First dataset and check shape
df_WineQuality = pd.read_excel("Part2 - Company.xlsx")
df_WineQuality.head(5)
df_WineQuality.shape
df_WineQuality["Quality"].unique()
| A | B | C | D | Quality | |
|---|---|---|---|---|---|
| 0 | 47 | 27 | 45 | 108 | Quality A |
| 1 | 174 | 133 | 134 | 166 | Quality B |
| 2 | 159 | 163 | 135 | 131 | NaN |
| 3 | 61 | 23 | 3 | 44 | Quality A |
| 4 | 59 | 60 | 9 | 68 | Quality A |
(61, 5)
array(['Quality A', 'Quality B', nan], dtype=object)
# Checking standard null values in data
df_WineQuality.isnull().sum()
A 0 B 0 C 0 D 0 Quality 18 dtype: int64
# Apply label encoder
from sklearn.preprocessing import LabelEncoder
_encoder = LabelEncoder()
_cols = df_WineQuality.columns
df_WineQuality[_cols] = df_WineQuality[_cols].apply(_encoder.fit_transform)
df_WineQuality["Quality"].unique()
array([0, 1, 2])
df_WineQuality.head()
| A | B | C | D | Quality | |
|---|---|---|---|---|---|
| 0 | 10 | 5 | 10 | 24 | 0 |
| 1 | 41 | 26 | 25 | 39 | 1 |
| 2 | 34 | 39 | 26 | 27 | 2 |
| 3 | 13 | 4 | 0 | 11 | 0 |
| 4 | 12 | 12 | 3 | 17 | 0 |
df_WineQuality["Quality"] = df_WineQuality["Quality"].replace(2,np.nan)
# Checking standard null values in data
df_WineQuality.isnull().sum()
A 0 B 0 C 0 D 0 Quality 18 dtype: int64
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2);
imputer.fit_transform(df_WineQuality);
CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette.
The vehicle may be viewed from one of many different angles.
DATA DESCRIPTION: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles
were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of
vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more
dif icult to distinguish between the cars.
All the features are numeric i.e. geometric features extracted from the silhouette.
PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using just the raw data.
Steps and tasks: [ Total Score: 20 points]
Step 1: Data: Import, clean and pre-process the data
# Import First dataset and check shape
df_Auto = pd.read_csv("Part3 - vehicle.csv")
df_Auto.sample(5)
df_Auto.shape
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 189 | 90 | 36.0 | 78.0 | 179.0 | 64.0 | 8 | 157.0 | 42.0 | 19.0 | 126 | 182.0 | 367.0 | 142.0 | 66.0 | 1.0 | 20.0 | 192.0 | 198 | car |
| 74 | 89 | 42.0 | 89.0 | 147.0 | 61.0 | 11 | 151.0 | 44.0 | 19.0 | 145 | 170.0 | 338.0 | 163.0 | 72.0 | 11.0 | 23.0 | 187.0 | 199 | van |
| 548 | 94 | 39.0 | 75.0 | 184.0 | 72.0 | 8 | 155.0 | 42.0 | 19.0 | 133 | 175.0 | 365.0 | 145.0 | 70.0 | 4.0 | 5.0 | 192.0 | 200 | bus |
| 822 | 95 | 41.0 | 82.0 | 170.0 | 65.0 | 9 | 145.0 | 46.0 | 19.0 | 145 | 163.0 | 314.0 | 140.0 | 64.0 | 4.0 | 8.0 | 199.0 | 207 | van |
| 235 | 90 | 48.0 | 78.0 | 134.0 | 56.0 | 11 | 160.0 | 43.0 | 20.0 | 167 | 169.0 | 366.0 | 185.0 | 76.0 | 1.0 | 14.0 | 182.0 | 192 | van |
(846, 19)
# Check info of data
df_Auto.info();
<class 'pandas.core.frame.DataFrame'> RangeIndex: 846 entries, 0 to 845 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 compactness 846 non-null int64 1 circularity 841 non-null float64 2 distance_circularity 842 non-null float64 3 radius_ratio 840 non-null float64 4 pr.axis_aspect_ratio 844 non-null float64 5 max.length_aspect_ratio 846 non-null int64 6 scatter_ratio 845 non-null float64 7 elongatedness 845 non-null float64 8 pr.axis_rectangularity 843 non-null float64 9 max.length_rectangularity 846 non-null int64 10 scaled_variance 843 non-null float64 11 scaled_variance.1 844 non-null float64 12 scaled_radius_of_gyration 844 non-null float64 13 scaled_radius_of_gyration.1 842 non-null float64 14 skewness_about 840 non-null float64 15 skewness_about.1 845 non-null float64 16 skewness_about.2 845 non-null float64 17 hollows_ratio 846 non-null int64 18 class 846 non-null object dtypes: float64(14), int64(4), object(1) memory usage: 125.7+ KB
# Checking standard null values in data
df_Auto.isnull().sum()
compactness 0 circularity 5 distance_circularity 4 radius_ratio 6 pr.axis_aspect_ratio 2 max.length_aspect_ratio 0 scatter_ratio 1 elongatedness 1 pr.axis_rectangularity 3 max.length_rectangularity 0 scaled_variance 3 scaled_variance.1 2 scaled_radius_of_gyration 2 scaled_radius_of_gyration.1 4 skewness_about 6 skewness_about.1 1 skewness_about.2 1 hollows_ratio 0 class 0 dtype: int64
# Imputing Null values with median values
df_Auto = df_Auto.fillna(df_Auto.median())
Step 2. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods. For example: Use your best analytical approach to build this report. Even you can mix match columns to create new columns which can be used for better analysis. Create your own features if required. Be highly experimental and analytical here to ind hidden patterns.
a) Univariate and Bivariate Analysis
# Describe the numerical data
df_Auto.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| compactness | 846.0 | 93.678487 | 8.234474 | 73.0 | 87.00 | 93.0 | 100.00 | 119.0 |
| circularity | 846.0 | 44.823877 | 6.134272 | 33.0 | 40.00 | 44.0 | 49.00 | 59.0 |
| distance_circularity | 846.0 | 82.100473 | 15.741569 | 40.0 | 70.00 | 80.0 | 98.00 | 112.0 |
| radius_ratio | 846.0 | 168.874704 | 33.401356 | 104.0 | 141.00 | 167.0 | 195.00 | 333.0 |
| pr.axis_aspect_ratio | 846.0 | 61.677305 | 7.882188 | 47.0 | 57.00 | 61.0 | 65.00 | 138.0 |
| max.length_aspect_ratio | 846.0 | 8.567376 | 4.601217 | 2.0 | 7.00 | 8.0 | 10.00 | 55.0 |
| scatter_ratio | 846.0 | 168.887707 | 33.197710 | 112.0 | 147.00 | 157.0 | 198.00 | 265.0 |
| elongatedness | 846.0 | 40.936170 | 7.811882 | 26.0 | 33.00 | 43.0 | 46.00 | 61.0 |
| pr.axis_rectangularity | 846.0 | 20.580378 | 2.588558 | 17.0 | 19.00 | 20.0 | 23.00 | 29.0 |
| max.length_rectangularity | 846.0 | 147.998818 | 14.515652 | 118.0 | 137.00 | 146.0 | 159.00 | 188.0 |
| scaled_variance | 846.0 | 188.596927 | 31.360427 | 130.0 | 167.00 | 179.0 | 217.00 | 320.0 |
| scaled_variance.1 | 846.0 | 439.314421 | 176.496341 | 184.0 | 318.25 | 363.5 | 586.75 | 1018.0 |
| scaled_radius_of_gyration | 846.0 | 174.706856 | 32.546277 | 109.0 | 149.00 | 173.5 | 198.00 | 268.0 |
| scaled_radius_of_gyration.1 | 846.0 | 72.443262 | 7.468734 | 59.0 | 67.00 | 71.5 | 75.00 | 135.0 |
| skewness_about | 846.0 | 6.361702 | 4.903244 | 0.0 | 2.00 | 6.0 | 9.00 | 22.0 |
| skewness_about.1 | 846.0 | 12.600473 | 8.930962 | 0.0 | 5.00 | 11.0 | 19.00 | 41.0 |
| skewness_about.2 | 846.0 | 188.918440 | 6.152247 | 176.0 | 184.00 | 188.0 | 193.00 | 206.0 |
| hollows_ratio | 846.0 | 195.632388 | 7.438797 | 181.0 | 190.25 | 197.0 | 201.00 | 211.0 |
# Pairplot for univariate and Bivariate analysis
sns.pairplot(df_Auto, hue="class", diag_kind='kde');
Step 3 and 4 SVM with and Without Dimensional reduction: perform dimensional reduction on the data.
# Standardizing the data for SVM and PCA both
from scipy.stats import zscore
df_Auto_Scaled = df_Auto.drop('class', axis=1)
df_Auto_Scaled=df_Auto_Scaled.apply(zscore)
df_Auto_Scaled.head()
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.160580 | 0.518073 | 0.057177 | 0.273363 | 1.310398 | 0.311542 | -0.207598 | 0.136262 | -0.224342 | 0.758332 | -0.401920 | -0.341934 | 0.285705 | -0.327326 | -0.073812 | 0.380870 | -0.312012 | 0.183957 |
| 1 | -0.325470 | -0.623732 | 0.120741 | -0.835032 | -0.593753 | 0.094079 | -0.599423 | 0.520519 | -0.610886 | -0.344578 | -0.593357 | -0.619724 | -0.513630 | -0.059384 | 0.538390 | 0.156798 | 0.013265 | 0.452977 |
| 2 | 1.254193 | 0.844303 | 1.519141 | 1.202018 | 0.548738 | 0.311542 | 1.148719 | -1.144597 | 0.935290 | 0.689401 | 1.097671 | 1.109379 | 1.392477 | 0.074587 | 1.558727 | -0.403383 | -0.149374 | 0.049447 |
| 3 | -0.082445 | -0.623732 | -0.006386 | -0.295813 | 0.167907 | 0.094079 | -0.750125 | 0.648605 | -0.610886 | -0.344578 | -0.912419 | -0.738777 | -1.466683 | -1.265121 | -0.073812 | -0.291347 | 1.639649 | 1.529056 |
| 4 | -1.054545 | -0.134387 | -0.769150 | 1.082192 | 5.245643 | 9.444962 | -0.599423 | 0.520519 | -0.610886 | -0.275646 | 1.671982 | -0.648070 | 0.408680 | 7.309005 | 0.538390 | -0.179311 | -1.450481 | -1.699181 |
from sklearn.decomposition import PCA
pca = PCA(n_components=18)
pca.fit(df_Auto_Scaled)
PCA(n_components=18)
# Eigen values
print(pca.explained_variance_)
[9.40460261e+00 3.01492206e+00 1.90352502e+00 1.17993747e+00 9.17260633e-01 5.39992629e-01 3.58870118e-01 2.21932456e-01 1.60608597e-01 9.18572234e-02 6.64994118e-02 4.66005994e-02 3.57947189e-02 2.74120657e-02 2.05792871e-02 1.79166314e-02 1.00257898e-02 2.96445743e-03]
# Explainabilty by each eigen vector
print(pca.explained_variance_ratio_)
[5.21860337e-01 1.67297684e-01 1.05626388e-01 6.54745969e-02 5.08986889e-02 2.99641300e-02 1.99136623e-02 1.23150069e-02 8.91215289e-03 5.09714695e-03 3.69004485e-03 2.58586200e-03 1.98624491e-03 1.52109243e-03 1.14194232e-03 9.94191854e-04 5.56329946e-04 1.64497408e-04]
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center');
plt.ylabel('Variation explained');
plt.xlabel('eigen Value');
plt.show();
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid');
plt.ylabel('Cum of variation explained');
plt.xlabel('eigen Value');
plt.show();
It can be seen that only 10 Dimensions are sufficient to explain almost 98% variablilty of the data. Hence selecting 10 PCA's
pca10 = PCA(n_components=10)
pca10.fit(df_Auto_Scaled)
Xpca10 = pca10.transform(df_Auto_Scaled)
Xpca10.shape
PCA(n_components=10)
(846, 10)
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, mean_absolute_error
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# Transform data into features and target
X_org = df_Auto_Scaled
y = df_Auto['class'].astype('category')
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X_org, y, test_size=0.2, random_state=7)
autosvm = SVC(gamma=0.025, C=3)
autosvm.fit(X_train , y_train)
y_pred = autosvm.predict(X_test)
SVC(C=3, gamma=0.025)
# Evaluate accuracy
print(accuracy_score(y_test, y_pred))
0.9764705882352941
# Performing Grid search for Best hyperparameters
from sklearn.model_selection import GridSearchCV
# defining parameter range
param_grid = {'C': [0.1, 1, 10],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf']};
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3);
# fitting the model for grid search
grid.fit(X_train, y_train);
# print best parameter after tuning
print(grid.best_params_)
# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)
Fitting 5 folds for each of 15 candidates, totalling 75 fits [CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.500 total time= 0.1s [CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.511 total time= 0.0s [CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.511 total time= 0.0s [CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.511 total time= 0.0s [CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.511 total time= 0.0s [CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.860 total time= 0.0s [CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.867 total time= 0.0s [CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.815 total time= 0.0s [CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.837 total time= 0.0s [CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.807 total time= 0.0s [CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.500 total time= 0.0s [CV 2/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.511 total time= 0.0s [CV 3/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.511 total time= 0.0s [CV 4/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.511 total time= 0.0s [CV 5/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.511 total time= 0.0s [CV 1/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.500 total time= 0.0s [CV 2/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.511 total time= 0.0s [CV 3/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.511 total time= 0.0s [CV 4/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.511 total time= 0.0s [CV 5/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.511 total time= 0.0s [CV 1/5] END ...C=0.1, gamma=0.0001, kernel=rbf;, score=0.500 total time= 0.0s [CV 2/5] END ...C=0.1, gamma=0.0001, kernel=rbf;, score=0.511 total time= 0.0s [CV 3/5] END ...C=0.1, gamma=0.0001, kernel=rbf;, score=0.511 total time= 0.0s [CV 4/5] END ...C=0.1, gamma=0.0001, kernel=rbf;, score=0.511 total time= 0.0s [CV 5/5] END ...C=0.1, gamma=0.0001, kernel=rbf;, score=0.511 total time= 0.0s [CV 1/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.779 total time= 0.0s [CV 2/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.800 total time= 0.0s [CV 3/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.815 total time= 0.0s [CV 4/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.741 total time= 0.0s [CV 5/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.793 total time= 0.0s [CV 1/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.985 total time= 0.0s [CV 2/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.956 total time= 0.0s [CV 3/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.963 total time= 0.0s [CV 4/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.948 total time= 0.0s [CV 5/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.956 total time= 0.0s [CV 1/5] END .......C=1, gamma=0.01, kernel=rbf;, score=0.963 total time= 0.0s [CV 2/5] END .......C=1, gamma=0.01, kernel=rbf;, score=0.941 total time= 0.0s [CV 3/5] END .......C=1, gamma=0.01, kernel=rbf;, score=0.919 total time= 0.0s [CV 4/5] END .......C=1, gamma=0.01, kernel=rbf;, score=0.933 total time= 0.0s [CV 5/5] END .......C=1, gamma=0.01, kernel=rbf;, score=0.904 total time= 0.0s [CV 1/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.500 total time= 0.0s [CV 2/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.519 total time= 0.0s [CV 3/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.526 total time= 0.0s [CV 4/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.511 total time= 0.0s [CV 5/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.511 total time= 0.0s [CV 1/5] END .....C=1, gamma=0.0001, kernel=rbf;, score=0.500 total time= 0.0s [CV 2/5] END .....C=1, gamma=0.0001, kernel=rbf;, score=0.511 total time= 0.0s [CV 3/5] END .....C=1, gamma=0.0001, kernel=rbf;, score=0.511 total time= 0.0s [CV 4/5] END .....C=1, gamma=0.0001, kernel=rbf;, score=0.511 total time= 0.0s [CV 5/5] END .....C=1, gamma=0.0001, kernel=rbf;, score=0.511 total time= 0.0s [CV 1/5] END .........C=10, gamma=1, kernel=rbf;, score=0.809 total time= 0.0s [CV 2/5] END .........C=10, gamma=1, kernel=rbf;, score=0.800 total time= 0.0s [CV 3/5] END .........C=10, gamma=1, kernel=rbf;, score=0.815 total time= 0.0s [CV 4/5] END .........C=10, gamma=1, kernel=rbf;, score=0.800 total time= 0.0s [CV 5/5] END .........C=10, gamma=1, kernel=rbf;, score=0.800 total time= 0.0s [CV 1/5] END .......C=10, gamma=0.1, kernel=rbf;, score=0.978 total time= 0.0s [CV 2/5] END .......C=10, gamma=0.1, kernel=rbf;, score=0.978 total time= 0.0s [CV 3/5] END .......C=10, gamma=0.1, kernel=rbf;, score=0.963 total time= 0.0s [CV 4/5] END .......C=10, gamma=0.1, kernel=rbf;, score=0.956 total time= 0.0s [CV 5/5] END .......C=10, gamma=0.1, kernel=rbf;, score=0.963 total time= 0.0s [CV 1/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.985 total time= 0.0s [CV 2/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.970 total time= 0.0s [CV 3/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.956 total time= 0.0s [CV 4/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.963 total time= 0.0s [CV 5/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.956 total time= 0.0s [CV 1/5] END .....C=10, gamma=0.001, kernel=rbf;, score=0.919 total time= 0.0s [CV 2/5] END .....C=10, gamma=0.001, kernel=rbf;, score=0.933 total time= 0.0s [CV 3/5] END .....C=10, gamma=0.001, kernel=rbf;, score=0.874 total time= 0.0s [CV 4/5] END .....C=10, gamma=0.001, kernel=rbf;, score=0.911 total time= 0.0s [CV 5/5] END .....C=10, gamma=0.001, kernel=rbf;, score=0.881 total time= 0.0s [CV 1/5] END ....C=10, gamma=0.0001, kernel=rbf;, score=0.507 total time= 0.0s [CV 2/5] END ....C=10, gamma=0.0001, kernel=rbf;, score=0.519 total time= 0.0s [CV 3/5] END ....C=10, gamma=0.0001, kernel=rbf;, score=0.526 total time= 0.0s [CV 4/5] END ....C=10, gamma=0.0001, kernel=rbf;, score=0.511 total time= 0.0s [CV 5/5] END ....C=10, gamma=0.0001, kernel=rbf;, score=0.511 total time= 0.0s
GridSearchCV(estimator=SVC(),
param_grid={'C': [0.1, 1, 10],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf']},
verbose=3)
{'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
SVC(C=10, gamma=0.1)
# Running Best model and Printing Accuracy
autosvm = SVC(gamma=0.1, C=10)
autosvm.fit(X_train , y_train)
y_pred = autosvm.predict(X_test)
# Evaluate accuracy
print(accuracy_score(y_test, y_pred))
SVC(C=10, gamma=0.1)
0.9823529411764705
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(Xpca10, y, test_size=0.2, random_state=7)
autosvm_pca = SVC(gamma=0.1, C=10)
autosvm_pca.fit(X_train , y_train)
y_pred = autosvm_pca.predict(X_test)
SVC(C=10, gamma=0.1)
# Evaluate accuracy
print(accuracy_score(y_test, y_pred))
0.9647058823529412
CONTEXT: Company X is a sports management company for international cricket.
DATA DESCRIPTION: The data is collected belongs to batsman from IPL series conducted so far.
Attribute Information:
PROJECT OBJECTIVE: Goal is to build a data driven batsman ranking model for the sports management company to make business decisions.
Steps and tasks: [ Total Score: 5 points]
# Import First dataset and check shape
df_BatsmRank = pd.read_csv("Part4 - batting_bowling_ipl_bat.csv")
df_BatsmRank.sample(5)
df_BatsmRank.shape
| Name | Runs | Ave | SR | Fours | Sixes | HF | |
|---|---|---|---|---|---|---|---|
| 144 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 143 | MJ Clarke | 98.0 | 16.33 | 104.25 | 12.0 | 0.0 | 0.0 |
| 83 | IK Pathan | 176.0 | 25.14 | 139.68 | 14.0 | 6.0 | 0.0 |
| 5 | V Sehwag | 495.0 | 33.00 | 161.23 | 57.0 | 19.0 | 5.0 |
| 104 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
(180, 7)
# Checking type of Data
df_BatsmRank.dtypes
Name object Runs float64 Ave float64 SR float64 Fours float64 Sixes float64 HF float64 dtype: object
# Checking standard null values in data
df_BatsmRank.isnull()
| Name | Runs | Ave | SR | Fours | Sixes | HF | |
|---|---|---|---|---|---|---|---|
| 0 | True | True | True | True | True | True | True |
| 1 | False | False | False | False | False | False | False |
| 2 | True | True | True | True | True | True | True |
| 3 | False | False | False | False | False | False | False |
| 4 | True | True | True | True | True | True | True |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 175 | False | False | False | False | False | False | False |
| 176 | True | True | True | True | True | True | True |
| 177 | False | False | False | False | False | False | False |
| 178 | True | True | True | True | True | True | True |
| 179 | False | False | False | False | False | False | False |
180 rows × 7 columns
df_BatsmRank_pca = df_BatsmRank.dropna()
Names = df_BatsmRank_pca["Name"]
Names.reset_index(inplace = True, drop = True)
df_BatsmRank_pca.head(5)
| Name | Runs | Ave | SR | Fours | Sixes | HF | |
|---|---|---|---|---|---|---|---|
| 1 | CH Gayle | 733.0 | 61.08 | 160.74 | 46.0 | 59.0 | 9.0 |
| 3 | G Gambhir | 590.0 | 36.87 | 143.55 | 64.0 | 17.0 | 6.0 |
| 5 | V Sehwag | 495.0 | 33.00 | 161.23 | 57.0 | 19.0 | 5.0 |
| 7 | CL White | 479.0 | 43.54 | 149.68 | 41.0 | 20.0 | 5.0 |
| 9 | S Dhawan | 569.0 | 40.64 | 129.61 | 58.0 | 18.0 | 5.0 |
from scipy.stats import zscore
df_BatsmRank_pca = df_BatsmRank_pca.drop('Name', axis=1)
df_BatsmRank_pca=df_BatsmRank_pca.apply(zscore)
df_BatsmRank_pca.head()
| Runs | Ave | SR | Fours | Sixes | HF | |
|---|---|---|---|---|---|---|
| 1 | 3.301945 | 2.683984 | 1.767325 | 1.607207 | 6.462679 | 4.651551 |
| 3 | 2.381639 | 0.896390 | 1.036605 | 2.710928 | 1.184173 | 2.865038 |
| 5 | 1.770248 | 0.610640 | 1.788154 | 2.281703 | 1.435530 | 2.269533 |
| 7 | 1.667276 | 1.388883 | 1.297182 | 1.300618 | 1.561209 | 2.269533 |
| 9 | 2.246490 | 1.174755 | 0.444038 | 2.343021 | 1.309851 | 2.269533 |
covMatrix = np.cov(df_BatsmRank_pca,rowvar=False)
print(covMatrix)
[[1.01123596 0.70077082 0.49903347 0.9291323 0.77842677 0.84453142] [0.70077082 1.01123596 0.63061271 0.55234856 0.69008186 0.62772842] [0.49903347 0.63061271 1.01123596 0.38913406 0.59050396 0.43238784] [0.9291323 0.55234856 0.38913406 1.01123596 0.52844526 0.79249429] [0.77842677 0.69008186 0.59050396 0.52844526 1.01123596 0.77632221] [0.84453142 0.62772842 0.43238784 0.79249429 0.77632221 1.01123596]]
from sklearn.decomposition import PCA
pca = PCA(n_components=6)
pca.fit(df_BatsmRank_pca)
PCA(n_components=6)
# Eigen values
print(pca.explained_variance_)
[4.30252561 0.83636692 0.41665751 0.32912443 0.16567829 0.01706297]
# Eigen vectors
print(pca.components_)
[[ 0.4582608 0.39797313 0.3253838 0.40574167 0.41733459 0.43237178] [ 0.26643209 -0.33111756 -0.69780334 0.47355804 -0.17902455 0.27593225] [-0.10977942 0.00550486 -0.45013448 -0.50823538 0.66942589 0.28082541] [-0.00520142 0.84736307 -0.43275029 -0.03252305 -0.24878157 -0.17811777] [ 0.45840889 -0.10122837 -0.11890348 0.09676885 0.39458014 -0.77486668] [ 0.70483594 -0.0606373 0.05624934 -0.58514214 -0.35786211 0.16096217]]
# Explainabilty by each eigen vector
print(pca.explained_variance_ratio_)
[0.70911996 0.13784566 0.06867133 0.05424458 0.02730624 0.00281223]
plt.bar(list(range(1,7)),pca.explained_variance_ratio_,alpha=0.5, align='center');
plt.ylabel('Variation explained');
plt.xlabel('eigen Value');
plt.show();
plt.step(list(range(1,7)),np.cumsum(pca.explained_variance_ratio_), where='mid');
plt.ylabel('Cum of variation explained');
plt.xlabel('eigen Value');
plt.show();
pca4 = PCA(n_components=4)
pca4.fit(df_BatsmRank_pca)
Xpca4 = pca4.transform(df_BatsmRank_pca)
Xpca4.shape
PCA(n_components=4)
(90, 4)
# Eigen values and Eigen vectors
Values = pd.DataFrame(pca4.explained_variance_)
Vectors = pd.DataFrame(pca4.explained_variance_ratio_)
# Score calculation for each feature
Score = Values*(Vectors)
print(Score)
Score.shape
0 0 3.051007 1 0.115290 2 0.028612 3 0.017853
(4, 1)
Final_Score = pd.DataFrame(Xpca4.dot(Score));
NameScore = Final_Score.join(Names)
NameScore.head()
| 0 | Name | |
|---|---|---|
| 0 | 26.031149 | CH Gayle |
| 1 | 14.235813 | G Gambhir |
| 2 | 12.656784 | V Sehwag |
| 3 | 11.905379 | CL White |
| 4 | 12.728287 | S Dhawan |
# Sort players with highest ranking first
NameScore.sort_values(0,axis=0,ascending=False)
| 0 | Name | |
|---|---|---|
| 0 | 26.031149 | CH Gayle |
| 1 | 14.235813 | G Gambhir |
| 4 | 12.728287 | S Dhawan |
| 2 | 12.656784 | V Sehwag |
| 5 | 12.479462 | AM Rahane |
| ... | ... | ... |
| 86 | -9.010853 | WD Parnell |
| 85 | -9.035490 | Z Khan |
| 87 | -9.169080 | PC Valthaty |
| 88 | -10.212834 | RP Singh |
| 89 | -11.675565 | R Sharma |
90 rows × 2 columns
Question: List down all possible dimensionality reduction techniques that can be implemented using python.
Answer:
Feature Projection based:
Feature selection based:
So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python
Answer. Yes, Its possible to use dimentionality reduction techniques to Multimedia and text data
# Consider an image data set of 8*8 Size having digits in it.
from sklearn.datasets import load_digits
# Import Factor analysis module
from sklearn.decomposition import FactorAnalysis
# Load digits data and getting its shape
X, _ = load_digits(return_X_y=True)
print("Original size",X.shape);
# Using Factor analysis to reduce dimention
transformer = FactorAnalysis(n_components=7, random_state=0)
X_transformed = transformer.fit_transform(X)
print("Tranformed size",X_transformed.shape);
Original size (1797, 64) Tranformed size (1797, 7)
!jupyter nbconvert --to='html' '/content/drive/MyDrive/AI_ML_course/Unsupervised_Learning/Unsupervised_Learning.ipynb'
[NbConvertApp] WARNING | pattern '/content/drive/MyDrive/AI_ML_course/Unsupervised_Learning/Unsupervised_Learning.ipynb' matched no files
This application is used to convert notebook files (*.ipynb)
to various other formats.
WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.
Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
<cmd> --help-all
--debug
set log level to logging.DEBUG (maximize logging output)
Equivalent to: [--Application.log_level=10]
--show-config
Show the application's configuration (human-readable format)
Equivalent to: [--Application.show_config=True]
--show-config-json
Show the application's configuration (json format)
Equivalent to: [--Application.show_config_json=True]
--generate-config
generate default config file
Equivalent to: [--JupyterApp.generate_config=True]
-y
Answer yes to any questions instead of prompting.
Equivalent to: [--JupyterApp.answer_yes=True]
--execute
Execute the notebook prior to export.
Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
Write notebook output to stdout instead of files.
Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
Run nbconvert in place, overwriting the existing notebook (only
relevant when converting to notebook format)
Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
Clear output of current file and save in place,
overwriting the existing notebook.
Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--no-prompt
Exclude input and output prompts from converted document.
Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
Exclude input cells and output prompts from converted document.
This mode is ideal for generating code-free reports.
Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True]
--allow-chromium-download
Whether to allow downloading chromium if no suitable version is found on the system.
Equivalent to: [--WebPDFExporter.allow_chromium_download=True]
--disable-chromium-sandbox
Disable chromium security sandbox when converting to PDF..
Equivalent to: [--WebPDFExporter.disable_sandbox=True]
--show-input
Shows code input. This flag is only useful for dejavu users.
Equivalent to: [--TemplateExporter.exclude_input=False]
--embed-images
Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports.
Equivalent to: [--HTMLExporter.embed_images=True]
--sanitize-html
Whether the HTML in Markdown cells and cell outputs should be sanitized..
Equivalent to: [--HTMLExporter.sanitize_html=True]
--log-level=<Enum>
Set the log level by value or name.
Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
Default: 30
Equivalent to: [--Application.log_level]
--config=<Unicode>
Full path of a config file.
Default: ''
Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
The export format to be used, either one of the built-in formats
['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf']
or a dotted object name that represents the import path for an
``Exporter`` class
Default: ''
Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
Name of the template to use
Default: ''
Equivalent to: [--TemplateExporter.template_name]
--template-file=<Unicode>
Name of the template file to use
Default: None
Equivalent to: [--TemplateExporter.template_file]
--theme=<Unicode>
Template specific theme(e.g. the name of a JupyterLab CSS theme distributed
as prebuilt extension for the lab template)
Default: 'light'
Equivalent to: [--HTMLExporter.theme]
--sanitize_html=<Bool>
Whether the HTML in Markdown cells and cell outputs should be sanitized.This
should be set to True by nbviewer or similar tools.
Default: False
Equivalent to: [--HTMLExporter.sanitize_html]
--writer=<DottedObjectName>
Writer class used to write the
results of the conversion
Default: 'FilesWriter'
Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
PostProcessor class used to write the
results of the conversion
Default: ''
Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
overwrite base name use for output files.
can only be used when converting one notebook at a time.
Default: ''
Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
Directory to write output(s) to. Defaults
to output to the directory of each notebook. To recover
previous default behaviour (outputting to the current
working directory) use . as the flag value.
Default: ''
Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
The URL prefix for reveal.js (version 3.x).
This defaults to the reveal CDN, but can be any url pointing to a copy
of reveal.js.
For speaker notes to work, this must be a relative path to a local
copy of reveal.js: e.g., "reveal.js".
If a relative path is given, it must be a subdirectory of the
current directory (from which the server is run).
See the usage documentation
(https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
for more details.
Default: ''
Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
The nbformat version to write.
Use this to downgrade notebooks.
Choices: any of [1, 2, 3, 4]
Default: 4
Equivalent to: [--NotebookExporter.nbformat_version]
Examples
--------
The simplest way to use nbconvert is
> jupyter nbconvert mynotebook.ipynb --to html
Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf'].
> jupyter nbconvert --to latex mynotebook.ipynb
Both HTML and LaTeX support multiple output templates. LaTeX includes
'base', 'article' and 'report'. HTML includes 'basic', 'lab' and
'classic'. You can specify the flavor of the format used.
> jupyter nbconvert --to html --template lab mynotebook.ipynb
You can also pipe the output to stdout, rather than a file
> jupyter nbconvert mynotebook.ipynb --stdout
PDF is generated via latex
> jupyter nbconvert mynotebook.ipynb --to pdf
You can get (and serve) a Reveal.js-powered slideshow
> jupyter nbconvert myslides.ipynb --to slides --post serve
Multiple notebooks can be given at the command line in a couple of
different ways:
> jupyter nbconvert notebook*.ipynb
> jupyter nbconvert notebook1.ipynb notebook2.ipynb
or you can specify the notebooks list in a config file, containing::
c.NbConvertApp.notebooks = ["my_notebook.ipynb"]
> jupyter nbconvert --config mycfg.py
To see all available configurables, use `--help-all`.